library(skimr)
library(tidyverse)
library(caret) # For featureplot, classification report
library(corrplot) # For correlation matrix
library(AppliedPredictiveModeling)
library(mice) # For data imputation
library(VIM) # For missing data visualization
library(gridExtra) # For grid plots
library(rpart) # For Decision Trees models
library(rpart.plot) # For Decision Tree Plots
library(randomForest) # For Random Forest models
library(randomForestExplainer) # For Random Forest Variable Importance Analysis
library(gbm) #Gradient BoostingThe penguin dataset is composed of 344 observations with 8 variables, 5 of which are numeric and 3 which are qualitative. The dataset is mostly complete with just a few observations with missing values that will need to be handled.
| Name | penguins |
| Number of rows | 344 |
| Number of columns | 8 |
| _______________________ | |
| Column type frequency: | |
| factor | 3 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| species | 0 | 1.00 | FALSE | 3 | Ade: 152, Gen: 124, Chi: 68 |
| island | 0 | 1.00 | FALSE | 3 | Bis: 168, Dre: 124, Tor: 52 |
| sex | 11 | 0.97 | FALSE | 2 | mal: 168, fem: 165 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| bill_length_mm | 2 | 0.99 | 43.92 | 5.46 | 32.1 | 39.23 | 44.45 | 48.5 | 59.6 | ▃▇▇▆▁ |
| bill_depth_mm | 2 | 0.99 | 17.15 | 1.97 | 13.1 | 15.60 | 17.30 | 18.7 | 21.5 | ▅▅▇▇▂ |
| flipper_length_mm | 2 | 0.99 | 200.92 | 14.06 | 172.0 | 190.00 | 197.00 | 213.0 | 231.0 | ▂▇▃▅▂ |
| body_mass_g | 2 | 0.99 | 4201.75 | 801.95 | 2700.0 | 3550.00 | 4050.00 | 4750.0 | 6300.0 | ▃▇▆▃▂ |
| year | 0 | 1.00 | 2008.03 | 0.82 | 2007.0 | 2007.00 | 2008.00 | 2009.0 | 2009.0 | ▇▁▇▁▇ |
The target variable of interest is the species of penguins, which are categorized into three groups: Adelie, Gentoo and Chinstrap penguins.
## [1] Adelie Gentoo Chinstrap
## Levels: Adelie Chinstrap Gentoo
From this plot, we can make a few key observations:
These island observations are valuable information in differentiating penguin species.
However, the sex of the penguins does not offer much information as the proportion is about even across all species. We can also note a few missing observations labeled as NA.
We noted from the data summary above that 11 observations were missing for the sex variable. There is also no reason to believe that the year the observation was taken would have any impact on the morphology of the penguins. We are not looking for any time series modeling. Therefore, we also drop year from our predictor variables. There are also two observations which are missing body measurements altogether, so these rows will be dropped altogether.
When looking at body measurements we see that Adelie and Chinstrap penguins largely overlap except for bill_length. This suggests that we might be able to use bill_depth, body_mass and flipper_length to differentiate the Gentoo penguins from the other species. However, the Adelie penguin stands out from the other others in bill_length
The scatterplot matrix below is another way to visualize the separation and overlap between classes for different combination of variables. We see that in general, Gentoo penguins standalone as a separate group. However, Adelie and Chinstrap penguins overlap in the comparison of bill_depth, flipper_length and body_mass.
We see on the univariate feature plots below that the data is aproximatelly normally distributed.
Taking a look at the correlation matrix below, we can make a few observations, notably that flipper_length is highly positively correlated with body_mass which makes sense given that larger penguins should have larger flippers. The other correlations are less obvious to interpret. Given that the dataset only contains a few predictors, we choose not to exclude any variables based on multicollinearity at this time.
The KNN algorithms requires minor data processing. Firstly, predictor values that are factors should be conversted to numeric. Secondly, because KNN uses distance between points to determine their classification, it is important for the points to be on the scaled appropriately. Here we pass the scale argument to the preProcess parameter of the training function to standardize each variable. The data is then split into training and testing sets 80%/20%. The test set contains 65 observations and the train set 268 observations.
# Processing
penguins_knn <- penguins
penguins_knn$sex <- as.numeric(penguins_knn$sex)-1 # recode as 1 or 0
# Data Partitioning
set.seed(622)
trainIndex <- createDataPartition(penguins_knn$species, p = .8, list = FALSE, times = 1)
knn_training <- penguins_knn[trainIndex,]
knn_testing <- penguins_knn[-trainIndex,]We performed 10-fold cross-validation in the training data to determine the optimal parameter k for our model. The resulting accuracy for each value of k is displayed and plotted below. The maximum accuracy is reached with values of k=3 and k=4 but the training procedure automatically chose k=4 as the best model. We gain a full percentage point in cross-validation accuracy on the training data using the tuned model over models with slightly more or fewer neighbors.
trControl <- trainControl(method = "cv",
number = 10)
knn.fit <- train(species ~ .,
method = "knn",
tuneGrid = expand.grid(k = 1:10),
trControl = trControl,
preProcess = c("center","scale"),
metric = "Accuracy",
data = knn_training)## k-Nearest Neighbors
##
## 268 samples
## 5 predictor
## 3 classes: 'Adelie', 'Chinstrap', 'Gentoo'
##
## Pre-processing: centered (5), scaled (5)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 241, 241, 240, 242, 241, 242, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 1 0.9851852 0.9769336
## 2 0.9850427 0.9765099
## 3 0.9962963 0.9941558
## 4 0.9962963 0.9941558
## 5 0.9887464 0.9821064
## 6 0.9851750 0.9764839
## 7 0.9851750 0.9764839
## 8 0.9888787 0.9823281
## 9 0.9888787 0.9823281
## 10 0.9888787 0.9823281
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 4.
The evaluation of the tuned K-NN model on the testing data reveals that the model was able to classify species with perfect accuracy. However, it is important to note that 100% prediction accuracy is typically rare and that this model benefitted from fairly clean class separations and limited overlap in the original dataset.
knnPredict <- predict(knn.fit, newdata = knn_testing)
confusionMatrix(knnPredict, knn_testing$species)## Confusion Matrix and Statistics
##
## Reference
## Prediction Adelie Chinstrap Gentoo
## Adelie 29 0 0
## Chinstrap 0 13 0
## Gentoo 0 0 23
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9448, 1)
## No Information Rate : 0.4462
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity 1.0000 1.0 1.0000
## Specificity 1.0000 1.0 1.0000
## Pos Pred Value 1.0000 1.0 1.0000
## Neg Pred Value 1.0000 1.0 1.0000
## Prevalence 0.4462 0.2 0.3538
## Detection Rate 0.4462 0.2 0.3538
## Detection Prevalence 0.4462 0.2 0.3538
## Balanced Accuracy 1.0000 1.0 1.0000
A loan-issuance company with presence across urban, semi urban and rural areas wants to validate the eligibility of customers to be granted a loan. Using an online application form, customers enter a series of attributes such as Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others. To automate the eligibility process, multiple supervised classification approaches can be used. In our case, we explore tree-based methods starting from simple Decision Trees (DT). Ensemble methods such as Random Forests (RF) and Gradient Boosting (GB) are also used to improve on the classification accuracy of the simple DT.
The loan dataset is composed of 13 variables and 614 observations. The target or dependent variable is Loan_Status and contains ‘Y’ or ‘N’ as entries. Predicting whether a loan will be approved is a supervised binary classification problem.
Eight of the variables are factors and 5 are numeric. The dataset contains missing values recorded either as ‘NA’ or simply empty. Some columns are missing nearly 10% of observations. Imputation of the missing values will be a step in the data pre-processing. We also note that the Loan_ID variable is simply an index and holds no valuable information, making it safe for removal. The Credit_History variable is coded as numeric but it is a binary variable with two levels.
loan_raw <- read.csv('https://raw.githubusercontent.com/maelillien/data622/main/hw3/Loan_approval.csv', header = TRUE)
loan_raw <- loan_raw %>% mutate_if(is.character, factor)
loan <- loan_raw# replace blank values with NA to allow for proper calculation of the complete_rate column in the data summary
loan[loan==''] <- NA
skim(loan)| Name | loan |
| Number of rows | 614 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| factor | 8 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| Loan_ID | 0 | 1.00 | FALSE | 614 | LP0: 1, LP0: 1, LP0: 1, LP0: 1 |
| Gender | 13 | 0.98 | FALSE | 2 | Mal: 489, Fem: 112, emp: 0 |
| Married | 3 | 1.00 | FALSE | 2 | Yes: 398, No: 213, emp: 0 |
| Dependents | 15 | 0.98 | FALSE | 4 | 0: 345, 1: 102, 2: 101, 3+: 51 |
| Education | 0 | 1.00 | FALSE | 2 | Gra: 480, Not: 134 |
| Self_Employed | 32 | 0.95 | FALSE | 2 | No: 500, Yes: 82, emp: 0 |
| Property_Area | 0 | 1.00 | FALSE | 3 | Sem: 233, Urb: 202, Rur: 179 |
| Loan_Status | 0 | 1.00 | FALSE | 2 | Y: 422, N: 192 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| ApplicantIncome | 0 | 1.00 | 5403.46 | 6109.04 | 150 | 2877.5 | 3812.5 | 5795.00 | 81000 | ▇▁▁▁▁ |
| CoapplicantIncome | 0 | 1.00 | 1621.25 | 2926.25 | 0 | 0.0 | 1188.5 | 2297.25 | 41667 | ▇▁▁▁▁ |
| LoanAmount | 22 | 0.96 | 146.41 | 85.59 | 9 | 100.0 | 128.0 | 168.00 | 700 | ▇▃▁▁▁ |
| Loan_Amount_Term | 14 | 0.98 | 342.00 | 65.12 | 12 | 360.0 | 360.0 | 360.00 | 480 | ▁▁▁▇▁ |
| Credit_History | 50 | 0.92 | 0.84 | 0.36 | 0 | 1.0 | 1.0 | 1.00 | 1 | ▂▁▁▁▇ |
The pre-processing steps are the following:
TotalIncome by summing applicant and coapplicant incomes. Typically loan issuers take into account the combined income of the applicant and guarantor.Loan_ID and the individual income variables that were just combined.Credit_History as a factor with 2 levels instead of a numeric variablesNote that the tree based methods employed in this exercise are not required to be coded as numeric or expanded as dummy variables. We can see from the data summary above that the remainder of the variables have the proper data type.
loan <- loan %>% mutate(TotalIncome = ApplicantIncome + CoapplicantIncome)
loan <- loan %>% select(-c('Loan_ID','ApplicantIncome','CoapplicantIncome'))
loan$Credit_History <- as.factor(loan$Credit_History)By examining the pattern plot of missing values below, we discern a slight pattern from variable to variable and observation to observation. We can consider a few scenarios to account for the missingness. For example, individuals who are married but separated might leave the ‘Married’ field blank. Individuals with non-binary gender identity would not choose ‘Male’ or ‘Female’ if the option to leave the field blank is a valid entry. However, it seems possible to account for the missing values based on the complete information of other variables. We might be able to predict that an individual is married or lives in a suburban area if they have a large number of dependents. Therefore we make the assumption that this dataset has missing values at random (MAR).
The simplest way of dealing with missing values is to conduct “complete case analysis” which involves dropping observations for which predictor values are missing. However, given the small number of observations (614) in this dataset, dropping observations in such fashion would result in significant data loss. Since the variable with the most missing values (nearly 8%), namely Credit_History, is suspected to the very predictive we will impute the missing values using various predictive models.
##
## Variables sorted by number of missings:
## Variable Count
## Credit_History 0.081433225
## Self_Employed 0.052117264
## LoanAmount 0.035830619
## Dependents 0.024429967
## Loan_Amount_Term 0.022801303
## Gender 0.021172638
## Married 0.004885993
## Education 0.000000000
## Property_Area 0.000000000
## Loan_Status 0.000000000
## TotalIncome 0.000000000
Given the assumption that this dataset has Missing At Random Values, we can use the MICE (Multivariate Imputation via Chained Equations) package to impute the missing values throughout the dataset. For each type of predictor variable we will use:
Below, we confirm that the imputing procedure was succesful and that the dataset no longer contains missing values.
##
## TRUE
## 614
By examining the target variables, we see that nearly 70% of all loans are approved.
We can make a few observations from boxplots of the numeric variables below:
Loan_Amount_Term only take a few discrete values representing various durations. The most common value by far, as indicated by the flat box is around 360 months meaning that the most common loan term is 30 years. This was fairly consistent across both outcomes of the dependent variable.LoanAmount does not greatly differ across the dependent variable. The interquartile range is slightly more compressed in the ‘Y’ category and there is a greater range of outliers values on the upper end of the range. The mean loan amounts were comparable for both outcomes.TotalIncome is fairly similar across the outcomes up to about $30,000 in total income. Interestingly, the observation with the highest total income recorded is the greatest outlier and was not issued a loan. Large skew is observed across both categories.From the bar plots of the categorical variables shwon below, we observe the following:
Credit_History: The majority of applicants had credit history; the large majority of which were approved for a loan. This is likely to be one of the most important factor in determining who gets a loan.Dependents: Individuals with 0 dependents form the majority of the cohort. Individual with fewer dependents may be less risk adverse and more willing to take on debt.Education: More individuals with graduate education applied for loans and a greater proportion of them received one in comparison to the “Not Graduate” counterparts.Gender: More than 3 times more males applied for loans than females. Both genders seem to be granted loans in the same proportion.Married: Married individuals applied for loans. This could be a consequence of needing to finance something like a home or a car which is more typical of married households.Property_Area: Inviduals living in semi-urban propety areas applied for the most number of loans but also had the greatest proportion of approved loans. Urban areas follow with with approximately 50% of approved load while rural areas has fewer applicants and a greater proportion of rejections.Self_Employed: Individuals who were not self-employed made up the large majority of the observations. This makes sense given that in general salaried employees greatly outnumber self-employed employees. Additionally, a self-employed individual may have less consistent streams of revenue and therefore might be less willing to take on debt.Interestingly, there is little difference in the relationship between loan amount and total income across the credit history category as seen by the nearly collinear regression lines. The slope was slightly higher for individuals with credit history which makes sense given that individuals with credit history are in general more likely to be granted loans. We can also observe that individuals on the low end of the total income axis and below the regression line generally had credit history. These are individuals with larger incomes but requesting less sizable loans.
The least squares lines across the dependent variable are also nearly collinear but the line representing individuals who received loans has a slightly greater slope suggesting that higher incomes unlock larger loans which is sensible and expected.
Other than credit history, total income (applicant income + coapplicant income) seems like the most logical basis for approving or denying a loan. The histograms below compare total income across all levels of the categorical variables. Total income is skewed to the right tail, with a few observations at the higher end of the log-transformed income scale. It makes sense to see more graduates at the tail end of income since their education should yield to higher paying jobs. The gender distribution is consistent with the discriminatory pay gap between men and women.
The data pre-processing section the structural form of the data. No additional processing is required for tree-based methods. We partition our data into training and testing sets in a 70%/30% proportion.
A Decision Tree is a type of supervised learning model where the data is recursively split into two or more sub-populations given a criteria. Each split is headed by a node, where the upper most node is called the root node, and nodes which cannot be split further are called terminal (leaf) nodes. All other nodes are considered internal nodes. Based on a given population of observations, the population is split into sub-populations with the criteria that each split separates the sub-population better than the previous split. This recursive splitting cycles through the different predictor variables in order to find leaf nodes that make the purest class separation or most accurate predictions.
The Decision Tree model can be used for regression and classification. In this case we use the model for classification of loan approvals.
The first step is to grow a baseline decision tree based on splits generating the most Information Gain, which is based on the decrease in entropy after a dataset is split on a given attribute. Entropy is a measure of node purity.
After growing a complete simple tree, we will need to decide if it requires pruning in order to reduce its complexity. Overly complex trees have high variance and do not predict well on new, unseen data. In our case, we use the Complexity Parameter set to a value of 0 as a measure of the required split improvement. The Cost Complexity parameter is slightly different from Information Gain and Gini index, but is conceptually similar. The parameter modulates the amount by which splitting a given node improved the relative error, in other words, it is the minimum improvement in the overall model needed at each node to justify the split.
loan_dt <- rpart(Loan_Status ~., data = training, method = "class",
parms = list(split="information"), control = rpart.control(cp=0))
loan_dt## n= 431
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 431 135 Y (0.31322506 0.68677494)
## 2) Credit_History=0 73 6 N (0.91780822 0.08219178) *
## 3) Credit_History=1 358 68 Y (0.18994413 0.81005587)
## 6) TotalIncome< 2457 8 2 N (0.75000000 0.25000000) *
## 7) TotalIncome>=2457 350 62 Y (0.17714286 0.82285714)
## 14) TotalIncome>=20199.5 11 5 N (0.54545455 0.45454545) *
## 15) TotalIncome< 20199.5 339 56 Y (0.16519174 0.83480826)
## 30) Loan_Amount_Term>=420 10 5 N (0.50000000 0.50000000) *
## 31) Loan_Amount_Term< 420 329 51 Y (0.15501520 0.84498480)
## 62) Property_Area=Rural 92 21 Y (0.22826087 0.77173913)
## 124) TotalIncome< 3606.5 9 4 N (0.55555556 0.44444444) *
## 125) TotalIncome>=3606.5 83 16 Y (0.19277108 0.80722892)
## 250) Dependents=1 7 3 N (0.57142857 0.42857143) *
## 251) Dependents=0,2,3+ 76 12 Y (0.15789474 0.84210526) *
## 63) Property_Area=Semiurban,Urban 237 30 Y (0.12658228 0.87341772)
## 126) TotalIncome>=3416 215 30 Y (0.13953488 0.86046512)
## 252) TotalIncome>=6577 77 15 Y (0.19480519 0.80519481)
## 504) TotalIncome< 6800 7 2 N (0.71428571 0.28571429) *
## 505) TotalIncome>=6800 70 10 Y (0.14285714 0.85714286) *
## 253) TotalIncome< 6577 138 15 Y (0.10869565 0.89130435) *
## 127) TotalIncome< 3416 22 0 Y (0.00000000 1.00000000) *
The tree fitted on the training data is displayed below. The simple baseline tree is 8 levels deep. Each node shows the predicted class, the probability of that class and the percentage of observations that fall into the given node. By studying the splits, we see that after rejecting an applicant based on the lack of credit history, the applicant has a 81% percent of being approved regardless of all the other variables. Factoring in TotalIncome which is an important feature of any loan application, only increased the percentage of approval by 1% to 82%. Further splitting by Loan Amount and Property Area only increased accuracy by 1% for each step.
To evaluate the simple tree model, we predict Loan_Status on the test set and calculate the performance metrics.
loan_dt_pred <- predict(loan_dt, testing, type = "class")
dt_cm <- confusionMatrix(loan_dt_pred, testing$Loan_Status)
dt_cm$table## Reference
## Prediction N Y
## N 29 21
## Y 28 105
performance_df <- data.frame(Model = NULL, Accuracy = NULL, Sensitivity = NULL, Specificity = NULL)
perf_dt <- data.frame(Model = "Base Decision Tree", Accuracy = dt_cm$overall[1], Sensitivity = dt_cm$byClass[1], Specificity = dt_cm$byClass[2])
performance_df <- rbind(performance_df, perf_dt)
perf_dtWe now try to prune the tree based on cross-validated error rates. In other words, we will try to find a Cost Complexity pramater which has the lowest cross-validation error rate and then supply this value back into growing the decision tree, but stop growing past a certain nodes where the cross-validation error rate is not minimal
From the below we see that the error rate is minimized when the tree is only at 2 levels. This indicates the Decision Tree is provides the most accurate predictions at 2 levels, which is just 1 split on Credit History. This is an unrealistic tree and for practical reasons, we chose to find the cost complexity parameter where pruning the tree will result in atleast 2 splits
#cost complexity parameter where number of splits is greater than 1 AND the cross-validation error is minimum
#min_err <- loan_dt$cptable[which.min(loan_dt$cptable[,"xerror"]),"CP"]
cost_complexity_dt <- data.frame(printcp(loan_dt))##
## Classification tree:
## rpart(formula = Loan_Status ~ ., data = training, method = "class",
## parms = list(split = "information"), control = rpart.control(cp = 0))
##
## Variables actually used in tree construction:
## [1] Credit_History Dependents Loan_Amount_Term Property_Area
## [5] TotalIncome
##
## Root node error: 135/431 = 0.31323
##
## n= 431
##
## CP nsplit rel error xerror xstd
## 1 0.4518519 0 1.00000 1.00000 0.071325
## 2 0.0296296 1 0.54815 0.54815 0.057993
## 3 0.0074074 2 0.51852 0.57037 0.058908
## 4 0.0052910 3 0.51111 0.60741 0.060360
## 5 0.0000000 10 0.47407 0.76296 0.065582
min_err <- (cost_complexity_dt %>% filter(nsplit > 1) %>% slice(which.min(xerror)))$CP
cat("Minimum Error: ", min_err)## Minimum Error: 0.007407407
Prune tree based on CP value that has lowest cross validation error rate with atleaset 2 splits. The resulting tree is 3 levels levels and the most significant predictors are Credit History and Total Income
Accuracy of pruned tree
loan_dt_prune_pred <- predict(loan_dt_prune, testing, type = "class")
dt_pruned_cm <- confusionMatrix(loan_dt_prune_pred, testing$Loan_Status)
dt_pruned_cm$table## Reference
## Prediction N Y
## N 25 7
## Y 32 119
perf_dt_pruned <- data.frame(Model = "Pruned Decision Tree", Accuracy = dt_pruned_cm$overall[1], Sensitivity = dt_pruned_cm$byClass[1], Specificity = dt_pruned_cm$byClass[2])
performance_df <- rbind(performance_df, perf_dt_pruned)
perf_dt_prunedThe Random Forest model is similar to Decision Tree model in terms of how it constructs a tree. The difference is that the Random Forest model is a collection (ensamble) of Decision Trees. Each of these trees in the forest are constructed bootstrapped observations and randomized subset of features. Each tree in the forest then classifies a given observation and then the majority of the classification class is determined as the predicted class for that given observation
To build an optimal forest there are 2 parameters that we can tune:
For classification datasets, the generally recommended subset of variables recommended is the squareroot of the total number of predictor variables. In our dataset this is this is 3, since we have 10 total predictor variables
Form the results below we see that 2 random variables per tree in the forest generates the lowest Out of Bag error. In quick summary, when using bootstrapped samples, about a third of the observations do not end up getting used in construction of the trees, and the model’s prediction on these new (to the model) observations is the Out of Bag Error
Therefore we will set the mtry parameter to the number of variables that generate the lowest Out of Bag error rate
## mtry = 3 OOB error = 17.4%
## Searching left ...
## mtry = 2 OOB error = 17.4%
## 0 0.05
## Searching right ...
## mtry = 6 OOB error = 18.56%
## -0.06666667 0.05
val_opt <- min_tree_var [,"mtry"][which.min(min_tree_var [,"OOBError"])]
loan_rf <- randomForest(Loan_Status ~., data = training, importance = TRUE, ntree=500, mtry = val_opt)
loan_rf##
## Call:
## randomForest(formula = Loan_Status ~ ., data = training, importance = TRUE, ntree = 500, mtry = val_opt)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 16.71%
## Confusion matrix:
## N Y class.error
## N 72 63 0.46666667
## Y 9 287 0.03040541
Next we want to see if we have generated enough trees so that the Out Of Bag (OOB Error) error rates are minimum. From the below we see that the OOB error rate is decreasing with 1-20 trees, and rate stabilizes that at around 100 trees. Therefore for this dataset 500 trees is more than enough to minimize the OOB Error rate
plot(loan_rf, col = c("black", "red", "green"), lty = c(1, 1, 1), main = "Predicted Loan Error Rates")
legend("right", c(OOB = "Out of Bag Error", "Denied", "Appoved"), col = c("black", "red", "green"), lty = c(1, 1, 1))An area of interest like Decision Trees is Variable importance, which is a scale of a given variable’s predictive power and is taken by taking the average purity of child nodes that the split causes across all trees in the forest. Variables with large increases in purity are considered more “important”
Below is a table of the predictor variables and their various importance factors. Some definitions to help understand the below table:
Predictors from the bootstrapped sample set that will give the highest purity are selected first. Credit History looks to be one of the most important predictors because:
Looking at the multi-way importance plot we see that Credit History:
Random Forest Accuracy
loan_rf_pred <- predict(loan_rf, testing, type = "class")
rf_cm <- confusionMatrix(loan_rf_pred, testing$Loan_Status)
rf_cm$table## Reference
## Prediction N Y
## N 24 4
## Y 33 122
perf_rf <- data.frame(Model = "Random Forest", Accuracy = rf_cm$overall[1], Sensitivity = rf_cm$byClass[1], Specificity = rf_cm$byClass[2])
performance_df <- rbind(performance_df, perf_rf)
perf_rfThis is another kind on ensemble algorithm in which several model are combined, but instead of running the model is parallel and then combining results, here model are run in series, with each output serving as an input to the next. With each new run of the classifier, the results are improved. Also, each classifiers performance can actually be rather low (shallow trees can be used for example), and with feeding the results to a new classifier, the overall effect is an improved performance.
Run gradient boost model
set.seed(123)
loan_boost <- gbm(Loan_Status ~., data = training,
n.trees=500,
interaction.depth=4,
shrinkage=0.01,
bag.fraction=0.5,
distribution="multinomial",
verbose=FALSE,
cv.folds = 5,
n.cores=2
)## Warning: Setting `distribution = "multinomial"` is ill-advised as it is
## currently broken. It exists only for backwards compatibility. Use at your own
## risk.
Check the performance of the model, also check the optimum number of trees for “cv” method. Looks like optimum numbers of tress is 111. We will run our predictions with that number.
## [1] 116
From the model we see that features Credit_History, TotalIncome and LoanAmount features are most important variables of the model and Gender is the least(which is good).
labels <- colnames(pred)[apply(pred, 1, which.max)]
result <- data.frame(testing$Loan_Status, labels)
gb_cm <- confusionMatrix(testing$Loan_Status, as.factor(labels))
gb_cm$table## Reference
## Prediction N Y
## N 23 34
## Y 4 122
So with gradient boosting we got 80.32% accuracy.
perf_gb <- data.frame(Model = "Gradient Boosting", Accuracy = gb_cm$overall[1], Sensitivity = gb_cm$byClass[1], Specificity = gb_cm$byClass[2])
performance_df <- rbind(performance_df, perf_gb)
perf_gb